A Grid Based System for Data Mining Using MapReduce
نویسندگان
چکیده
In this paper, we discuss a Grid data mining system based on the MapReduce paradigm of computing. The MapReduce paradigm emphasizes system automation of fault tolerance and redundancy, while keeping the programming model for the user very simple. MapReduce is built closely on top of a distributed file system, that allows efficient distributed storage of large data sets, and allows computation to be scheduled closely to this data. Many machine learning algorithms can be easily integrated into this environment. We explore the potential of the MapReduce paradigm for general large scale data mining. We offer several modifications to the existing MapReduce scheduling system to bring it from a cluster environment to a campus grid that includes desktop PCs, servers and clusters. We provide an example implementation of a machine learning algorithm (the Probabilistic Neural Network) in MapReduce form. We also discuss a MapReduce simulator that can be used to develop further enhancements to the MapReduce system. We provide simulation results for two new proposed scheduling algorithms, designed to improve MapReduce processing on the grid. These scheduling algorithms provide increased storage efficiency and increased job processing speed, when used in a heterogeneous grid environment. This work will be used in the future to produce a fully functioning implementation of the MapReduce runtime system for a grid environment, that will enable easy, data intensive parallel computing for machine learning, with little to no additional hardware investment.
منابع مشابه
Review of Apriori Based Algorithms on MapReduce Framework
The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement ...
متن کاملA New High Frequency Grid Impedance Estimation Technique for the Frequency Range of 2 to150 kHz
Grid impedance estimation is used in many power system applications such as grid connected renewable energy systems and power quality analysis of smart grids. The grid impedance estimation techniques based on signal injection uses Ohm’s law for the estimation. In these methods, one or several signal(s) is (are) injected to Point of Common Coupling (PCC). Then the current through and voltage of ...
متن کاملGrid Impedance Estimation Using Several Short-Term Low Power Signal Injections
In this paper, a signal processing method is proposed to estimate the low and high-frequency impedances of power systems using several short-term low power signal injections for a frequency range of 0-150 kHz. This frequency range is very important, and thusso it is considered in the analysis of power quality issues of smart grids. The impedance estimation is used in many power system applicati...
متن کاملMapReduce K-Means based Co-Clustering Approach for Web Page Recommendation System
Co-clustering is one of the data mining techniques used for web usage mining. Co-clustering Web log data is the process of simultaneous categorization of both users and pages. It is used to extract the users’ information based on subset of pages. Nowadays, the cyberspace is filled with huge volume of data distributed across the world. The business knowledge acquaintance from such a voluminous d...
متن کاملA Survey on Parallel Method for Rough Set using MapReduce Technique for Data Mining
In this paper Present survey on Data mining, Data mining using Rough set Theory and Data Mining using parallel method for rough set Approximation with MapReduce Technique. With the development of Information technology data growing at a tremendous rate, so big data mining and knowledge discovery become a new challenge. Rough set theory has been successfully applied in data mining by using MapRe...
متن کامل